feat(datasets): Phase 1 in-house sets — domain-bilingual-v1 + negatives-ood-v1#7
Merged
Merged
Conversation
8cbc031 to
1af56be
Compare
0c0795a to
57d6eb1
Compare
1af56be to
3d46bf4
Compare
…atives-ood-v1) Approved design implementing eval-plan datasets ii/iii: single bilingual dataset with engine-gated blended thresholds plus an offline zh/en split recorded in BASELINES.md, a reused 24-doc corpus, LLM-generated-then-verified queries, and an observe-only off-corpus negatives set. Calibrate thresholds at observed - margin from one real-vector run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
8-task plan: per-language metric splitter (TDD), corpus scaffold, LLM query generation via the MiniMax factory, curation/materialization, real-vector calibration with observed-margin thresholds, zh/en split into BASELINES.md, and the stacked PR. dikw-core read-only throughout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Recovers the zh/en breakdown (docs/dikw-eval-plan.md §2.4) from an eval NDJSON's per_query rows, since dataset.yaml thresholds are flat. Metric formulas mirror dikw-core/src/dikw_core/eval/metrics.py exactly so split_metrics(all) reconciles with the engine's blended doc metrics. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…us copy) Both reuse synthetic-diverse-v2's 24-doc corpus (12 zh / 12 en). dataset.yaml thresholds left empty; domain-bilingual-v1 is calibrated to observed-margin after the real-vector run, negatives-ood-v1 stays observe-only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…via factory domain-bilingual-v1: 34 verified positives (18 zh + 16 en) covering all 24 docs, with deliberate intra-cluster-confusable queries in the history clusters for ranking signal. negatives-ood-v1: 23 verified expect_none queries (11 zh + 12 en), plausible-but-uncovered. Generated through the MiniMax factory (scripts/generate_candidates.py --instruction steering) then human-verified gold. Fixes: - generate_candidates.py: add --instruction passthrough for targeted generation. - llm_client.py: raise output budget to 16000 tokens — MiniMax-M2.7 reasoning was exhausting the 4096 cap and truncating JSON mid-array (stop_reason max_tokens). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…margin Canonical doc/hybrid observed 1.0 (the 24-distinct-topic corpus saturates the vector/hybrid views). Gate at observed-margin: hit@k/mrr 0.95, ndcg/recall 0.97 — a regression-detector floor, not a discriminative benchmark. Denser confusable domain-bilingual-v2 noted as the discriminative follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tives-ood-v1) Blended per-mode table + zh/en split (saturated 1.0; bm25 mrr 0.985/ndcg 0.989 the only signal), domain-bilingual-v1 floor at observed-margin, negatives-ood-v1 observe-only diagnostics. Splitter reconciles with the engine within 1e-9. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
57d6eb1 to
3a79e58
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements eval-plan datasets ii (
domain-bilingual-v1) and iii(
negatives-ood-v1) — the first in-house Phase-1 retrieval sets. Design:docs/phase1-inhouse-datasets-design.md; plan:docs/superpowers/plans/.What's here
domain-bilingual-v1— 34 verified positives (18 zh + 16 en) over the reusedsynthetic-diverse-v224-doc corpus; every doc covered; intra-cluster-confusablehistory queries for ranking signal. Single dataset, blended engine gate; zh/en
split recorded in
reports/BASELINES.md(§2.4) viatools/split_metrics_by_lang.py.negatives-ood-v1— 23 verifiedexpect_nonequeries (11 zh + 12 en),plausible-but-uncovered; observe-only (
expect_noneis diagnostic-only in dikw-core).tools/split_metrics_by_lang.py(+ tests) — per-language metric split from aneval NDJSON; formulas mirror dikw-core's, reconciles with the engine within 1e-9.
--instructionsteering forgenerate_candidates.py; raised MiniMaxoutput budget to 16000 (reasoning was truncating candidate JSON at 4096).
Calibration (real-vector, dikw-core v0.6.2, MiniMax+Gitee)
domain-bilingual-v1: passed, exit 0. Saturates at 1.0 on vector/hybrid (24distinct topics); bm25
mrr 0.985 / ndcg 0.989the only signal. Gate atobserved−margin (
hit@k/mrr 0.95,ndcg/recall 0.97) — a regression-detectorfloor, not a discriminative benchmark. Denser
domain-bilingual-v2is thediscriminative follow-up.
negatives-ood-v1: passed, exit 0, observe-only.Verification
ruff+mypy src+pytestgreen; both datasets validate; eval-gate contentcheck passes locally;
dikw-coretracked tree clean (read-only held).Merge order #5 → #6 → #7 (GitHub auto-retargets on each merge).
🤖 Generated with Claude Code